# FIR Filter Design and Implementation Project Report

Paul Nieves Prof. Zhang Advanced VLSI Design March 14<sup>th</sup> 2024

# Table of Contents

| Abstract                                              | 3  |
|-------------------------------------------------------|----|
| 1. MATLAB FIR Filter Design and Simulation            | 3  |
| 1.1 MATLAB Implementation                             | 3  |
| 1.2 Simulation Setup                                  | 3  |
| 2. Filter Frequency Response and Quantization Effects | 4  |
| 2.1 Frequency Response Analysis                       | 4  |
| 2.2 Quantization and Overflow Management              | 4  |
| 3. FIR Filter Architecture                            | 4  |
| 3.1 Architectural Overview                            | 4  |
| 3.2 Verilog Code Structure                            | 7  |
| 3.3 Filter Coefficients and Polyphase Decomposition   | 7  |
| 4. Hardware Implementation Results                    | 8  |
| 4.1 Implementation Details                            | 8  |
| 4.2 Area Utilization                                  | 8  |
| 4.3 Clock Frequency and Timing                        | 8  |
| 4.4 Power Estimation                                  | 9  |
| 4.5 Summary Table                                     | 9  |
| 4.6 Filter Simulation Results                         | 10 |
| 5. Further Analysis and Conclusion                    | 11 |
| 5.1 Performance vs. Complexity                        | 11 |
| 5.2 Quantization Impact                               | 12 |
| 5.3 Scalability                                       | 12 |
| 5.4 Conclusion                                        | 12 |

# **Abstract**

This project focuses on the design and implementation of a low-pass FIR filter using both MATLAB and Verilog. The design aims to filter out noise from a sine wave signal while meeting stringent specifications, such as a transition region between  $0.2\pi$  and  $0.23\pi$  rad/sample and a stopband attenuation of at least 80 dB. Various architectural approaches are explored, including pipelining and parallel processing, to achieve high performance on FPGA hardware.

# 1. MATLAB FIR Filter Design and Simulation

# 1.1 MATLAB Implementation

MATLAB was employed in two primary aspects of this project. First, for FIR filter design, MATLAB's built-in functions (as described in the MathWorks FIR filter-design documentation) were used to construct a 102-tap low-pass FIR filter. The filter was initially designed in its ideal (un-quantized) form to meet the desired specifications. Second, a MATLAB script named sin\_gen.m was developed to generate a noisy sine wave. This script creates a clean sine wave at 1 kHz, adds white Gaussian noise based on a specified Signal-to-Noise Ratio (SNR), and then scales the signal to fit within the range of a 16-bit integer. The binary representation of the scaled signal is saved to a file, providing a test input for simulation purposes.

# 1.2 Simulation Setup

The generated noisy sine wave serves as the test input for the filter model in simulation. In ModelSim (or another FPGA simulation tool), this input is fed to the Verilog filter implementations to verify that the designed FIR filter successfully removes the noise while preserving the desired signal.



# 2. Filter Frequency Response and Quantization Effects

### 2.1 Frequency Response Analysis

The ideal, un-quantized filter exhibits the expected passband along with a steep roll-off in the transition region. After quantizing the filter coefficients, the frequency response was reevaluated. Although quantization introduces minor deviations from the ideal response, careful scaling and rounding strategies have ensured that the stopband attenuation remains at or above 80 dB.

# 2.2 Quantization and Overflow Management

The filter coefficients were quantized to a signed 32-bit representation, and analysis showed that this precision is sufficient to closely match the frequency response achieved with floating-point arithmetic. Additionally, techniques such as proper scaling of the input, output, and intermediate data were implemented to prevent arithmetic overflow during the multiply-accumulate (MAC) operations.



Figure 2.2: Frequency Response Comparison Original vs Quantized FIR Filter

# 3. FIR Filter Architecture

#### 3.1 Architectural Overview

To meet performance requirements and optimize for FPGA implementation, multiple FIR filter architectures were explored:

#### • Directly Implemented (Traditional) Architecture:

The traditional FIR filter design follows a straightforward Multiply-Accumulate (MAC) approach, where each input sample is delayed, multiplied by a corresponding coefficient, and accumulated sequentially. While simple and easy to implement, this architecture is not optimized for high-speed applications due to its long critical path and lack of pipelining.



Figure 3: Directly Implemented (Traditional) Architecture

#### • Pipelined Architecture:

The pipelined design (see module fir\_pipeline) breaks the MAC operation into stages, each handling a tap. This architecture improves the throughput by allowing operations to be overlapped across several clock cycles.



Figure 4: FIR Pipelined

#### Parallel Processing:

Two parallel architectures were implemented:

L=2 Parallel Processing: The filter is partitioned into two sub-filters, denoted as H0 and H1, which process alternating input samples separately (see module FIR\_Filter\_L2\_Top). These sub-filters use the traditional FIR filter architecture.

The outputs of H0 and H1 are then combined to reconstruct the final filtered signal, effectively doubling the throughput.



Figure 5: Reduced-complexity 2-parallel FIR filter

L=3 Parallel Processing: Similarly, the filter is divided into three sub-filters, denoted as H0, H1, and H2 (see module FIR\_Filter\_L3\_Top). Each sub-filter processes every third input sample independently using the traditional FIR filter architecture. This approach increases throughput for higher data rates while distributing computational load across the three sub-filters



Figure 6: Reduced-complexity 3-parallel FIR filter

#### • Combined Pipelining and Parallel Processing:

The FIR\_Filter\_L3\_Pipelined\_Top module integrates both pipelining and L=3 parallel processing for even higher performance. Unlike the standalone L=3 architecture, the subfilters in this implementation use a pipelined version of the traditional FIR filter, reducing critical path delays and allowing for higher clock frequencies.

#### 3.2 Verilog Code Structure

The Verilog code is organized into modular components:

- **Delay Line Implementation:** Used in every module to store input samples.
- **Coefficient Loading:** Filter coefficients are read from external files (e.g., "lpFilterTapsBinarySigned.txt") during initialization.
- MAC Operation: Each tap multiplies the corresponding delay sample with its coefficient and accumulates the result, with intermediate pipeline registers used to support high clock frequencies.

### 3.3 Filter Coefficients and Polyphase Decomposition

In the parallel processing architectures, the FIR filter coefficients are split into sub-filters H0, H1 and H2 (for L=3) or H0 and H1 (for L=2). These sub-filters each use the traditional FIR filter architecture and are obtained through polyphase decomposition.

#### Filter Length and Sub-Filter Coefficients

- The original FIR filter has N coefficients.
- In L=2 parallel processing, the coefficients are split into two sub-filters:
  - o H0 contains the even-indexed coefficients: h[0], h[2], h[4],...
  - o H1 contains the odd-indexed coefficients: h[1], h[3], h[5],...
  - $\circ$  The effective filter length of each sub-filter is N/2.
- In L=3 parallel processing, the coefficients are split into three sub-filters:
  - o H0 contains coefficients at indices 0, 3, 6,...
  - o H1 contains coefficients at indices 1, 4, 7,...
  - o H2 contains coefficients at indices 2, 5, 8,...
  - The effective filter length of each sub-filter is N/3.

#### **Coefficient Generation and Binary Representation**

The coefficients are generated using a Python script (floatmakercoeficent.py), which:

- 1. Reads the full set of filter coefficients.
- 2. Applies polyphase decomposition based on the parallelization factor (L=2 or L=3).
- 3. Computes additional coefficient sums for optimized hardware implementation:
  - $\circ$  H0+H1
  - o H1+H2 (for L=3)
  - $\circ$  H0+H1+H2 (for L=3)

4. Converts the coefficients into a 32-bit binary representation and stores them in text files for FPGA implementation.

By reducing the length of each sub-filter, parallel processing decreases the latency per filter operation and increases throughput. However, it requires additional summation logic to combine the sub-filter outputs correctly.

# 4. Hardware Implementation Results

#### 4.1 Implementation Details

The hardware implementation of the FIR filter design was carried out using Synopsys Design Compiler targeting the gscl45nm technology. The designs were synthesized for various architectures, including a simple pipelined design and more advanced parallelized versions. These reports provide important metrics that illustrate the trade-offs between area, speed, and power consumption.

#### 4.2 Area Utilization

The synthesis reports provided key insights into the area utilization of each design. For the pipelined design (module *fir\_pipeline*), the report shows a modest total cell area of approximately 811 units, with 276 ports, 387 nets, and a balanced distribution of 64 combinational and 64 sequential cells. In contrast, the *FIR\_Filter\_L3\_Pipelined\_Top* design exhibits a much larger cell area of about 13,831 units, reflecting the increased complexity from extensive parallel processing with 3,600 ports and 6,856 nets. The intermediate designs, *FIR\_Filter\_L2\_Top* and *FIR\_Filter\_L3\_Top*, show total cell areas of approximately 1,763 and 5,624 units, respectively, with corresponding increases in port and net counts.

# 4.3 Clock Frequency and Timing

All architectures were synthesized with a defined clock period of 21276.000 ns, which corresponds to an effective clock frequency of approximately 47 kHz. Detailed timing reports indicate substantial setup margins alongside positive hold and pulse width slacks, ensuring robust and reliable operation. The timing metrics for the different architectures are as follows:

| Architect<br>ure                   | Worst<br>Negative<br>Slack<br>(WNS) | Worst<br>Hold<br>Slack<br>(WHS) | Worst Pulse Width Slack (WPWS) |
|------------------------------------|-------------------------------------|---------------------------------|--------------------------------|
| FIR<br>Pipeline                    | 21170.4<br>ns                       | 0.038 ns                        | 10637.46<br>8 ns               |
| L=2<br>Parallel<br>Processin<br>g  | 21168.15<br>4 ns                    | 0.044 ns                        | 10637.46<br>8 ns               |
| L=3<br>Parallel<br>Processin<br>g  | 21166.47<br>3 ns                    | 0.024 ns                        | 10637.46<br>8 ns               |
| Combine<br>d<br>Pipelined<br>& L=3 | 21165.09<br>6 ns                    | 0.019 ns                        | 10637.46<br>8 ns               |

Table 1: Timing Summary for All Architectures

The high setup (negative) slack across all designs confirms that the timing constraints are met with significant headroom, while the positive hold slack ensures stable operation. Additionally, the consistent pulse width slack across architectures further validates the robustness of the timing performance.

#### **4.4 Power Estimation**

Power estimation was evaluated under typical operating conditions at a 1.1 V supply. The *fir\_pipeline* design demonstrated negligible dynamic power (approximately 0 μW) with a leakage power of around 8.22 μW. In the more advanced architectures, the *FIR\_Filter\_L3\_Pipelined\_Top* design reported a total dynamic power of approximately 246.13 μW and leakage power of 105.16 μW, resulting in a total power consumption of about 351.29 μW. Similarly, the *FIR\_Filter\_L2\_Top* design consumed roughly 12.43 μW of dynamic power and 17.08 μW of leakage power, totaling around 29.51 μW, while the *FIR\_Filter\_L3\_Top* design exhibited 83.81 μW dynamic power and 42.62 μW leakage power, amounting to a total power of approximately 126.43 μW.

### **4.5 Summary Table**

The following table summarizes the key synthesis metrics, including area, power (dynamic, leakage, and total), and other pertinent data across the various FIR filter architectures.

| Design                        | Ports | Nets | Total<br>Cells | Comb.<br>Cells | Seq. Cells | Total Cell<br>Area | Dynamic<br>Power | Leakage<br>Power | Total<br>Power |
|-------------------------------|-------|------|----------------|----------------|------------|--------------------|------------------|------------------|----------------|
| fir_pipelin<br>e              | 276   | 387  | 258            | 64             | 64         | 810.95             | ~0 μW            | 8.22 μW          | 8.22 μW        |
| FIR_Filter<br>_L2_Top         | 652   | 820  | 601            | 145            | 128        | 1762.69            | 12.43 μW         | 17.08 μW         | 29.51 μW       |
| FIR_Filter<br>_L3_Top         | 1660  | 2596 | 1611           | 692            | 256        | 5624.09            | 83.81 μW         | 42.62 μW         | 126.43<br>μW   |
| FIR_Filter _L3_Pipel ined_Top | 3600  | 6856 | 3447           | 1743           | 640        | 13831.21           | 246.13<br>μW     | 105.16<br>μW     | 351.29<br>μW   |

Table 2: Summary of Synthesis Metrics for FIR Filter Architectures

# **4.6 Filter Simulation Results**



Figure 6.1: FIR Pipelined Testbench Results



Figure 6.2: Reduced-complexity parallel processing L=2 Testbench Results



Figure 6.3: Reduced-complexity parallel processing L=3 Testbench Results



Figure 6.4: Reduced-complexity parallel processing L=3 and Pipelined Testbench Results

# 5. Further Analysis and Conclusion

# **5.1 Performance vs. Complexity**

The project illustrates a careful balance between performance and complexity achieved through both pipelining and parallel processing. The L=3 parallel architecture provided significant

improvements in throughput and processing speed, as seen in the increased cell area (13,831.21 units) and higher power consumption (351.29 µW total power) in the more advanced designs. However, these benefits came at the cost of increased design complexity and additional logic required to manage data dependencies. On the other hand, the simpler pipelined design (*fir\_pipeline*) maintained a minimal power footprint (~8.22 µW leakage) while offering satisfactory performance. This highlights the trade-offs between improved speed and resource utilization.

# **5.2 Quantization Impact**

The quantization strategy employed in this design, using a signed 32-bit representation for the filter coefficients, was highly effective. Analysis confirmed that this level of precision closely matched the floating-point frequency response, ensuring that quantization errors remained minimal. Furthermore, the implemented techniques for scaling and overflow management successfully prevented arithmetic overflow during multiply-accumulate (MAC) operations, preserving the overall filter performance.

#### 5.3 Scalability

The modular approach used in the Verilog implementation demonstrated excellent scalability. This structure enables easy integration of additional filter taps and further parallelization with minimal modifications. The ability to scale the design efficiently ensures adaptability to a variety of applications, making it a practical and reusable solution.

#### **5.4 Conclusion**

This project successfully demonstrates the design and implementation of a low-pass FIR filter that meets stringent architectural requirements. The integration of MATLAB for initial filter design and noise simulation, along with a robust and scalable Verilog implementation optimized through pipelining and parallel processing, provides valuable insights into practical design tradeoffs. Detailed synthesis metrics—including area utilization, power consumption, and timing analysis—highlight the effectiveness of the chosen design strategies and confirm that the design meets its performance targets while maintaining reliable operation.